Implement Shard and CoordinatorLog metadata durability #4360

iamaleksey · 2025-09-03T16:02:24Z

patch by Aleksey Yeschenko; reviewed by Abe Ratnofsky for CASSANDRA-20882

src/java/org/apache/cassandra/replication/Node2OffsetsMap.java

src/java/org/apache/cassandra/replication/Shard.java

src/java/org/apache/cassandra/replication/MutationTrackingService.java

src/java/org/apache/cassandra/replication/Shard.java

aratno · 2025-09-08T18:36:58Z

src/java/org/apache/cassandra/replication/CoordinatorLog.java

+            localWitnessed = Offsets.Mutable.copy(witnessedOffsets.get(localNodeId));
+
+            witnessedOffsets.convertToPrimitiveMap(witnessed);
+            persistedOffsets.convertToPrimitiveMap(persisted);


Would be good to make this incremental in the future, since every witnessed offset will eventually be persisted and reconciled

I'm not sure about this one. We must overwrite the entire value of each list in the system table because of the representation, but that shouldn't be a big deal - the size of each list should be roughly the same - the head, with not gaps, collapsed into one range, plus the slightly sparse tail. I don't see how to make it incremental in this context, or what it'd bring. I could easily be missing something though.

Right, I wasn't thinking about the frozen representation at the time. Makes sense as-is for now.

persistedOffsets could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.

patch by Aleksey Yeschenko; reviewed by Abe Ratnofsky for CASSANDRA-20882

aratno

LGTM. Would be good to have a test extending org.apache.cassandra.fuzz.topology.TopologyMixupTestBase to test bounces as well.

Broadcasting persisted offsets every minute feels infrequent to me. The more frequently we broadcast persisted offsets, the sooner we can mark SSTables as repaired and purge CoordinatorLogOffsets. We want to avoid doing too many compactions on an SSTable before we can mark it repaired, so maybe we determine persistence and broadcast timing based on the number of mutations that have been marked reconciled, more frequent when reconciliation is happening quickly.

aratno · 2025-09-19T02:41:17Z

src/java/org/apache/cassandra/replication/UnreconciledMutations.java

+            for (int offset = iter.start(), end = iter.end(); offset <= end; offset++)
+            {
+                ShortMutationId id = new ShortMutationId(witnessed.logId, offset);
+                result.addForTesting(MutationJournal.instance.read(id));


Can we avoid calling a *ForTesting method from a production path?

That must've been a botched refactoring rename, clearly unintended. Thanks for catching it.

aratno · 2025-09-19T03:16:35Z

src/java/org/apache/cassandra/replication/CoordinatorLog.java

+            localWitnessed = Offsets.Mutable.copy(witnessedOffsets.get(localNodeId));
+
+            witnessedOffsets.convertToPrimitiveMap(witnessed);
+            persistedOffsets.convertToPrimitiveMap(persisted);


Right, I wasn't thinking about the frozen representation at the time. Makes sense as-is for now.

persistedOffsets could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.

iamaleksey · 2025-09-19T09:34:40Z

persistedOffsets could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.

I must be missing something, but why would it grow? We aren't recording what's reconciled - we are recording what's been witnessed and written to the system table.

aratno · 2025-09-19T14:55:34Z

Discussed out-loud - I had some concerns about our offsets representation causing trouble if we end up with sparse representations, but that's a niche case that will impact more than just the durability paths. We should measure the logical vs. physical size of offsets, or "dropped" mutation IDs, to see if that's a problem for any real workloads. Merge away.

iamaleksey · 2025-09-19T14:57:50Z

Committed as bd5a657, cheers.

iamaleksey requested a review from aratno September 3, 2025 16:02

iamaleksey self-assigned this Sep 3, 2025

aratno approved these changes Sep 8, 2025

View reviewed changes

iamaleksey added 5 commits September 11, 2025 14:27

Implement Shard and CoordinatorLog metadata durability

1e5a9b5

patch by Aleksey Yeschenko; reviewed by Abe Ratnofsky for CASSANDRA-20882

Clean up

aa60d6b

Fix TCM/startup

4beb463

Review feedback /1

9c93836

Review feeedback /2

13bd806

iamaleksey force-pushed the 20882 branch from 317762a to 13bd806 Compare September 12, 2025 15:01

iamaleksey requested a review from aratno September 12, 2025 15:02

iamaleksey added 2 commits September 15, 2025 16:13

Load UnreconciledMutations from journal

481a62d

Some Shard tests: system tables and log rotation

4049ff0

aratno approved these changes Sep 19, 2025

View reviewed changes

iamaleksey closed this Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Shard and CoordinatorLog metadata durability #4360

Implement Shard and CoordinatorLog metadata durability #4360

Uh oh!

iamaleksey commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aratno Sep 8, 2025

Uh oh!

iamaleksey Sep 12, 2025

Uh oh!

aratno Sep 19, 2025

Uh oh!

aratno left a comment

Uh oh!

aratno Sep 19, 2025

Uh oh!

iamaleksey Sep 19, 2025

Uh oh!

aratno Sep 19, 2025

Uh oh!

iamaleksey commented Sep 19, 2025 •

edited

Loading

Uh oh!

aratno commented Sep 19, 2025

Uh oh!

iamaleksey commented Sep 19, 2025

Uh oh!

Uh oh!

Implement Shard and CoordinatorLog metadata durability #4360

Implement Shard and CoordinatorLog metadata durability #4360

Uh oh!

Conversation

iamaleksey commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aratno Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

iamaleksey Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

aratno Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

aratno left a comment

Choose a reason for hiding this comment

Uh oh!

aratno Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

iamaleksey Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

aratno Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

iamaleksey commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aratno commented Sep 19, 2025

Uh oh!

iamaleksey commented Sep 19, 2025

Uh oh!

Uh oh!

iamaleksey commented Sep 19, 2025 •

edited

Loading